Using linguistically motivated features for paragraph boundary identification

نویسندگان

  • Katja Filippova
  • Michael Strube
چکیده

In this paper we propose a machinelearning approach to paragraph boundary identification which utilizes linguistically motivated features. We investigate the relation between paragraph boundaries and discourse cues, pronominalization and information structure. We test our algorithm on German data and report improvements over three baselines including a reimplementation of Sporleder & Lapata’s (2006) work on paragraph segmentation. An analysis of the features’ contribution suggests an interpretation of what paragraph boundaries indicate and what they depend on.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coping imbalanced prosodic unit boundary detection with linguistically-motivated prosodic features

Continuous speech input for ASR processing is usually presegmented into speech stretches by pauses. In this paper, we propose that smaller, prosodically defined units can be identified by tackling the problem on imbalanced prosodic unit boundary detection using five machine learning techniques. A parsimonious set of linguistically motivated prosodic features has been proven to be useful to char...

متن کامل

Linguistically-motivated automatic classification of regional French varieties

The goal of this study is to automatically differentiate French varieties (standard French and French varieties spoken in the South of France, Alsace, Belgium and Switzerland) by applying a linguistically-motivated approach. We took advantage of automatic phoneme alignment to measure vowel formants, consonant (de)voicing, pronunciation variants as well as prosodic cues. These features were then...

متن کامل

Language Identification in Code-Switching Scenario

This paper describes a CRF based token level language identification system entry to Language Identification in CodeSwitched (CS) Data task of CodeSwitch 2014. Our system hinges on using conditional posterior probabilities for the individual codes (words) in code-switched data to solve the language identification task. We also experiment with other linguistically motivated language specific as ...

متن کامل

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

Using linguistically-defined specific details to detect deception across domains

Current automatic deception detection approaches tend to rely on cues that are based either on specific lexical items or on linguistically abstract features that are not necessarily motivated by the psychology of deception. Notably, while approaches relying on such features can do well when the content domain is similar for training and testing, they suffer when content changes occur. We invest...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006